The Exploration-Exploitation Tradeoff in Sequential Decision Making Problems
نویسنده
چکیده
Sequential decision making problems often require an agent to act in an environment where data is noisy or not fully observed. The agent will have to learn how different actions relate to different rewards, and must therefore balance the need to explore and exploit in an effective strategy. In this report, sequential decision making problems are considered through extensions of the multi-armed bandit framework. Firstly, the bandit problem is extended to a Multi-Agent System (MAS), where agents control individual arms but can communicate potentially useful information with each other. This framework allows for a better understanding of the exploration-exploitation tradeoff in scenarios where there are multiple agents interacting in a noisy environment. To this end, we present a novel strategy for action and communication decisions and we demonstrate the benefits of such a strategy empirically. This motivates a theoretical analysis of one-armed bandit problems, to develop ideas of how different strategies are optimally tuned. Specifically, the expected rewards of -greedy strategies are derived, as well as proofs governing their optimal tuning.
منابع مشابه
Exploration-Free Policies in Dynamic Pricing and Online Decision-Making
Growing availability of data has enabled practitioners to tailor decisions at the individuallevel. This involves learning a model of decision outcomes conditional on individual-specific covariates or features. Recently, contextual bandits have been introduced as a framework to study these online and sequential decision making problems. This literature predominantly focuses on algorithms that ba...
متن کاملUniversity of Alberta NEW REPRESENTATIONS AND APPROXIMATIONS FOR SEQUENTIAL DECISION MAKING UNDER UNCERTAINTY
This dissertation research addresses the challenge of scaling up algorithms for sequential decision making under uncertainty. In my dissertation, I developed new approximation strategies for planning and learning in the presence of uncertainty while maintaining useful theoretical properties that allow larger problems to be tackled than is practical with exact methods. In particular, my research...
متن کاملPsychological Models of Human and Optimal Performance in Bandit Problems Action Editor: Andrew Howes
In bandit problems, a decision-maker must choose between a set of alternatives, each of which has a fixed but unknown rate of reward, to maximize their total number of rewards over a sequence of trials. Performing well in these problems requires balancing the need to search for highly-rewarding alternatives, with the need to capitalize on those alternatives already known to be reasonably good. ...
متن کاملPsychological models of human and optimal performance in bandit problems
In bandit problems, a decision-maker must choose between a set of alternatives, each of which has a fixed but unknown rate of reward, to maximize their total number of rewards over a sequence of trials. Performing well in these problems requires balancing the need to search for highly-rewarding alternatives, with the need to capitalize on those alternatives already known to be reasonably good. ...
متن کاملCorrelational Dueling Bandits with Application to Clinical Treatment in Large Decision Spaces
We consider sequential decision making under uncertainty, where the goal is to optimize over a large decision space using noisy comparative feedback. This problem can be formulated as a Karmed Dueling Bandits problem where K is the total number of decisions. When K is very large, existing dueling bandits algorithms suffer huge cumulative regret before converging on the optimal arm. This paper s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009